3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

「attention関数はqueryとkey-valueのペアの集合と、output（出力）の写像として記述されうる」

「query, keys, values, outputはすべてベクトル」

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

「outputはvalueの重み付き和として計算される」

「各valueに割り当てられる重みは、queryと対応するkeyの互換性関数によって計算される」

（「queryとkeyがベクトルとして近いほど重みが大きくなる」を説明している？）

3.2.1 Scaled Dot-Product Attention

Figure 2 左側

（マスクは3.1のdecoderで言及されたものか！）

（行列で扱っているからScaledということではないか）

入力

queryとkeyは次元がd_k

valueは次元がd_v

We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the values.

「queryとすべてのkeyとのdot product（＝内積）を計算し、それぞれを√dk（ベクトルの次元の平方根）で割り、ソフトマックス関数をかけ、valueへの重みを得る」

Figure 2 左側のQとKを入力する部分を説明している（valueへの重みとvalueとのMatMulも暗に示される）

「慣例として、行列Qに詰めて、queryの集合についてattention関数を同時に計算する」

keyとvalueも行列KとVに詰められる

式(1): 行列Q, K, Vを引数とするattention function

The two most commonly used attention functions are additive attention, and dot-product (multiplicative) attention.

「最も広く使われるattention関数2つは加法的attentionとdot-product（乗法的）attention」

加法的attentionへのreferenceにNeural Machine Translation by Jointly Learning to Align and Translate

dot-product attentionがこの論文のアルゴリズムにとって理想的と続く

スケール係数 1/√dkはこの論文で追加

dot-product attentionはadditive attentionと比較して「慣例的にずっと早く（メモリ）効率がよりよい」

最終段落が1/√dkの説明

We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients

「大きなdkの値について、softmax関数を極端に小さな勾配の範囲に押し込む（？）ことで、内積は大規模に大きくなると見ている」

脚注4も参照

この効果を中和するための1/√dk

3.2.2 Multi-Head Attention

Figure 2 右側

Scaled Dot-Product Attentionをh個並列で使っている

（計算効率の他に、勉強会であった「見方を変える」も実現していそう）

Linear（線形写像）もh個

Instead of performing a single attention function with d_model-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.

「単一のattention関数をd_model次元のkey, value, queryに適用する代わりに、query, key, valueをh回、dk, dk, dv次元への訓練される別々の線形写像でそれぞれ線形に写像することが効率的と気づいた」

query, key, valueの写像したバージョンのそれぞれに並列でattention関数を適用する、結果dv次元の出力値を得る、と続く

These are concatenated and once again projected, resulting in the final values

「Concatされ、最終的な値とするためにもう一度写像される」

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

「Multi-head attentionによりモデルは別々の位置における別々の表現部分空間からの情報を共同で扱える」

MultiHead(Q, K, V)の式

Concat(head1, ..., head_h)W_O

W_Oはパラメタ行列: (h d_v, d_model)

（Multi-Head Attentionを通ってきて、d_vがh個concatしている）

where head_i = Attention(Q W_iQ, K W_iK, V W_iV)

パラメタ行列

W_iQとW_iK: (d_model, d_k)

W_iV: (d_model, d_v)

この論文ではh=8（8 heads）

For each of these we use dk = dv = d_model/h = 64.

「これらのそれぞれについて、dk=dv=d_model/h=(512/8)=64」

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

「各ヘッドの削減された次元により、合計の計算コストは全次元のsingle-head attentionの計算コストと同じになる」

（ヘッドの数で次元を割ると、concatしたときにd_modelに揃うことにもなる。計算コストを大きくしすぎない効果もある）

3.2.3 Applications of Attention in our Model

multi-head attentionを3つの異なるやり方で使う

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

「"encoder-decoder attention"層では、queryは前のdecoder layerから来て、メモリ（？）のkeyとvalueはencoderの出力から来る」

This allows every position in the decoder to attend over all positions in the input sequence.

「decoderにおけるすべての位置が入力系列の全ての位置に渡って扱える」

（この内容、NLP2022 柴田さんチュートリアルで言っていた気がする）

sequence-to-sequenceのモデルを真似たとのこと

The encoder contains self-attention layers.

「エンコーダはself-attention層を含む」

In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder.

「self-attention層ではkey, value, queryの全てが同じ箇所から来る」

「今回のケースでは、encoderの前の層の出力である」

（前の層＝encoderのlayerがN=6積まれている）

Each position in the encoder can attend to all positions in the previous layer of the encoder.

「encoderの各位置はencoderの前の層の全ての位置を扱える」

Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.

「同様に、decoderのself-attention層により、decoderの各位置はdecoderのその位置以前の全ての位置を扱える」

We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.

「auto-regressive propertyを保存するために、decoderでは左方向の情報の流れを防ぐ必要がある」

（文章の末尾から先頭の方向のことと思われる）

We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

「scaled dot-product attentionの内側で、softmaxの入力における不的確な接続に対応するすべての値をマスクアウトする（-∞に設定する）よう実装した」